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Abstract 


Early diagnosis is important for type 2 diabetes (T2D) to improve patient prognosis, pre¬ 
vent complications and reduce long-term treatment costs. We present a novel risk profiling 
approach based exclusively on health expenditure data that is available to Belgian mu¬ 
tual health insurers. We used expenditure data related to drug purchases and medical 
provisions to construct models that predict whether a patient will start glucose-lowering 
pharmacotherapy in the coming years, based on that patient’s recent medical expenditure 
history. The design and implementation of the modeling strategy are discussed in detail 
and several learning methods are benchmarked for our application. Our best performing 
model obtains between 74.9% and 76.8% area under the ROC curve, which is comparable 
to state-of-the-art risk prediction approaches for T2D based on questionnaires. In contrast 
to other methods, our approach can be implemented on a population-wide scale at virtually 
no extra operational cost. Possibly, our approach can be further improved by additional 
information about some risk factors of T2D that is unavailable in health expenditure data. 
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1. Introduction 

Type 2 diabetes mellitus (T2D) is a chronic metabolic disorder characterized by hyper¬ 
glycemia and is considered one of the main threats to human health (Zimmet et ah, 2001). 
In developed countries, T2D makes up about 85% of diabetes mellitus patients and occurs 
when either insufficient insulin is produced, the body becomes resistant to insulin or both 
(World Health Organization et ah, 1994). Prediabetes and less severe cases of T2D are ini¬ 
tially managed by lifestyle changes, specifically increasing physical exercise, dietary change 
and smoking cessation (Tuomilehto et ah, 2001; Diabetes Prevention Program Research 
Group et ah, 2002; American Diabetes Association et ah, 2014). If this yields insufficient 
glycemic control, pharmacotherapy with glucose-lowering agents (GLAs) like metformin or 
insulin is started (Turner et ah, 1999; American Diabetes Association et ah, 2014). 

Several studies have indicated that one third to one half of T2D patients are undiag¬ 
nosed (Harris et ah, 1998a; King et ah, 1998; Rubin et ah, 1994). Additionally, patients 
often remain undiagnosed for extended periods of time, with average diagnose-free intervals 
ranging from 4 to 7 years (Harris et ah, 1992). The prognosis of untreated patients can 
deteriorate rapidly as prolonged hyperglycemia can cause serious damage to many of the 
body’s systems. Timely diagnosis of T2D proves challenging in contemporary medicine, as 
many patients already present signs of complications of the disease at the time of clinical 
diagnosis of T2D (Harris et ah, 1998b; Rajala et ah, 1998; Kohner et ah, 1998; Ballard 
et ah, 1988; Harris and Eastman, 2000; Hu et ah, 2002). 

Earlier diagnosis and subsequent treatment is believed to prevent or delay complica¬ 
tions and improve prognosis (Pauker, 1993; Engelgau et ah, 2000). When impaired glucose 
tolerance is diagnosed early, initial treatment can often be limited to lifestyle changes (Pan 
et ah, 1997; Tuomilehto et ah, 2001; Diabetes Prevention Program Research Group et ah, 
2002). Compared to pharmacotherapy, lifestyle changes are simple, fully manageable by 
the patient and far less likely to cause serious treatment-induced complications like hypo¬ 
glycemia (Seltzer, 1989; Zammitt and Frier, 2005). Complementary to health benefits, early 
diagnosis of T2D poses a health economical advantage, as patients that do not require acute 
or intensive long-term treatment are far less demanding on the health care system. 

Universal screening for T2D is cost-prohibitive (Wareham and Griffin, 2001; Engelgau 
et ah, 2000), but many organizations advise opportunistic screening of high-risk subgroups 
(World Health Organization et ah, 1994; Alberti et ah, 1998; Engelgau et ah, 2000; Ameri¬ 
can Diabetes Association et ah, 2014). Several risk profiling strategies have been developed 
to aid in the timely diagnosis of T2D (Baan et ah, 1999; Stern et ah, 2002; Lindstrom and 
Tuomilehto, 2003; McNeely et ah, 2003; Charlone et ah, 2004; Heikes et ah, 2008; Schwarz 
et ah, 2009). Risk profiling is typically done by assessing some of the key risk factors for 
T2D, which include obesity (Mokdad et ah, 2003), genetic predisposal (Shai et ah, 2006; 
Consortium et ah, 2013), lifestyle (Reis et ah, 2011) and various clinical parameters. Ex¬ 
isting risk profiling approaches are implemented via questionnaires, potentially augmented 
with clinical information that is available to the patient’s general practitionner (Griffin 
et ah, 2000; Spijkerman et ah, 2004; Lindstrom and Tuomilehto, 2003; Gliimer et ah, 2004; 
Schulze et ah, 2007; Heikes et ah, 2008). Commonly required information includes BMI, 
family history, exercise and smoking habits and various clinical parameters. 
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In this work, we present an alternative approach for risk profiling which only requires 
data that is already available to Belgian mutual health insurers. This work was done in 
collaboration with the National Alliance of Christian Mutualities (NACM). NACM is the 
largest Belgian mutual health insurer with over four million members. Our approach does 
not require any questionnaires or additional clinical information and predicts whether a 
patient will start taking GLAs in the next few years. Interestingly, our approach works well 
despite the fact that Belgian health insurer data contains little direct information regarding 
key risk factors of T2D, that is weight, lifestyle and family history are all unavailable. 

2. Existing type 2 diabetes risk profiling approaches 

The Cambridge Risk Score (CRS) was developed to assess the probability of undiagnosed 
T2D based on data that is routinely available in primary care records, including age, sex, 
medication use, family history of diabetes, BMI and smoking status (Griffin et ah, 2000), 
The CRS has been shown to be useful on multiple occasions (Griffin et ah, 2000; Park et ah, 
2002; Spijkerman et ah, 2004), though its AUC seems to depend heavily on the population 
in which it is used, ranging between 67% (Spijkerman et ah, 2004) and 80% (Griffin et ah, 
2000). The information used in the CRS is comparable to another approach which obtained 
AUCs ranging between 70% and 78% (Baan et ah, 1999). 

The FINDRISC score is based on a 10-year follow-up using age, BMI, waist circum¬ 
ference, history of antihypertensive drugs and high blood glucose, physical activity and 
diet with reported AUCs of 85% and 87% in predicting drug-treated diabetes (Lindstrom 
and Tuomilehto, 2003). The strongest reported predictors in this study were BMI, waist 
circumference, history of high blood glucose and physical activity. Gliimer et ah (2004) 
developed a risk score based on age, sex, BMI, known hypertension, physical activity and 
family history of diabetes with AUC ranging from 72% to 87.6%. The German diabetes 
risk score reached AUCs ranging from 75% to 83% on validation data and is based on age, 
waist circumference, height, history of hypertension, physical activity, smoking, and diet 
(Schulze et ah, 2007). 

Heikes et ah (2008) developed a decision tree for risk prediction achieving 82% AUC in a 
cross-validation setting, based on weight, age, family history and various clinical parameters. 
Various other approaches based on routine clinical information have demonstrated similarly 
accurate predictions of type 2 diabetes (Stern et ah, 2002; McNeely et ah, 2003). 

3. Health expenditure data 

The Belgian health care insurance is a broad solidarity-based form of social insurance. 
Mutual health insurers such as NACM are the legally-appointed bodies for managing and 
providing the Belgian compulsory health care and disability insurance, among other things. 
To implement their operations, Belgian mutual health insurers dispose of large databases 
containing health expenditure records of all their respective members. 

These expenditure records hold all financial reimbursements of drugs, procedures and 
contacts with health care professionals. Each record comprises a timestamp, financial details 
and a description of the claim. The financial aspect is irrelevant from a medical point of 
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view, but the type of resource-use as indicated by the description can contain medical 
information about the patient. These types belong to one of two main categories; 

1. Drug purchases are recorded per package. The coding of packages contains infor¬ 
mation about the active substances in the drug along with the volume of the package. 

2. Medical provisions are identihed by a national encoding along with an identifier of 
the associated medical caregiver. Each provision has a distinct code number. 

In addition to resource-use data, some biographical information is available about each 
patient including age, gender, place of residence and social parameters. In the remainder 
of this Section we will elaborate on expenditure records related to drugs and provisions. 
Subsequently we will briefly summarize the main strengths and limitations of using health 
expenditure data for predictive modeling. 

3.1 Records related to drug purchases 

Expenditure records concerning drug purchases contain information about the active sub¬ 
stances in the drug and the purchased volume. We mapped all active substances onto the 
anatomical therapeutic chemical (ATC) classification system maintained by the WHO Col¬ 
laborating Centre for Drug Statistics Methodology (2015). The ATC classification system 
divides active substances into different groups based on the organ or system on which they 
act and their therapeutic, pharmacological and chemical properties. Each drug is classihed 
in groups at 5 levels in the ATC hierarchy: fourteen main groups (1st level), pharmaco¬ 
logical/therapeutic subgroups (2nd level), chemical subgroups (3rd and 4th level) and the 
chemical substance (5th level). 

After mapping records onto the ATC classification system, a patient’s medication history 
consists of specific ATC codes (5th level) along with the associated number of defined daily 
doses (DDD). In the period of interest, purchases of 4,580 distinct active substances were 
recorded in the NACM database. Table 1 shows an example of the classification of active 
substance on all levels in the ATC system. 


level 

ATC code 

description 

1 

A 

alimentary tract and metabolism 

2 

AlO 

drugs used in diabetes 

3 

AlOB 

blood glucose lowering drugs, excluding insulins 

4 

AlOBA 

biguanides 

5 

A10BA02 

metformin 


Table 1: Example of the ATC classification system: classification of metformin per level. 


3.2 Records related to medical provisions 

Expenditure records concerning medical provisions can be considered tuples containing 
time-stamped identihers of the patient, physician and medical provision. A single patient- 
physician interaction may yield multiple such records, one for each specific provision that 
occurred. 
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In the Belgian health care system, medical provisions are encoded via the Belgian nomen¬ 
clature of medical provisions (Van den Oever and Volckaert, 2008), which is maintained by 
the National Institute for Health and Disability Insurance (NIHDI)d This nomenclature is 
an unstructured list of unique codes (numbers) for each provision that is being refunded. 
Nomenclature numbers are added when new provisions are defined or when revisions are 
made. A single provision may correspond to multiple numbers for various reasons. 

3.3 Advantages of health expenditure data 

The key benefit of expenditure databases is that they centralize structured medical informa¬ 
tion across all medical stakeholders to yield a comprehensive, longitudinal overview of each 
patient’s medical history. Other health data sources are fragmented, e.g. medical records 
maintained by the patient’s general practitioner or hospital often contain only a subset of 
the patient’s medical history. This fragmentation hampers the identification of patterns 
that may indicate elevated risk for diseases like type 2 diabetes. The NACM database 
comprises claims records of over four million Belgians, which enables complex modeling. 
Additionally, claims data have few omissions due to the financial incentive for patients and 
medical stakeholders (hospitals) to claim refunds. While other health data sources may 
contain more detailed information, the strength of NACM’s data is in its volume, both in 
terms of number of patients and the amount of information that is recorded per individual. 
Finally, as most people tend to stay affiliated with the same mutual health insurer, their 
expenditure records provide long-term information. 

3.4 Limitations of health expenditure data 

Belgian health expenditure data is strictly limited to what is required for mutual health 
insurers to implement their operations, which are mainly administrative in nature. Detailed 
health information such as diagnoses and test results are not directly available. In some 
other countries, health insurers dispose of more detailed information, such as ICD-10 codes 
which include diagnoses and symptoms (World Health Organization et ah, 2012). Including 
such information is out of scope of this work as we focus exclusively on data that is already 
available to Belgian mutual health insurers. Biographical information about patients does 
not contain direct information about some important risk factors such as lifestyle, family 
history and BMI, though this may be partially embedded indirectly in medical resource-use. 

4. Methods 

In this Section we define the prediction task and describe all its aspects; the overall setup 
(Section 4.1), the data and its representation (Section 4.2) and the learning algorithms 
(Section 4.3). Briefly, our aim is to predict which patients will start glucose-lowering phar¬ 
macotherapy within the next 4 years, based on expenditure records of the previous 4 years. 

Our key hypothesis is that patients with increased risk for T2D or those that are already 
afflicted but not diagnosed have a different medical expenditure history than patients with¬ 
out impaired glycemic control. We essentially use the start of GLA therapy as a proxy for 


1. The website of NIHDI is available at http://www.riziv.fgov.be 
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diagnosis of (advanced) type 2 diabetes. This is reasonable since most patients that start 
GLA therapy above 40 years old have T2D (World Health Organization et ah, 1994). 

We posed this task as a binary classification problem. Our classifiers produce a numeric 
level of confidence that a given patient will start glucose-lowering pharmacotherapy. When 
predicting a population, the outputs can be used to rank patients according to decreasing 
confidence that the patients will start glucose-lowering therapy. Highly ranked patients 
represent a high-risk subgroup which can be targetted for clinical screening. 

The full learning setup is described in Section 4.1, involving different learning methods 
and representations of patients’ expenditure data. Briefly, we used nested cross-validation to 
obtain unbiased estimates of the predictive performance of each vectorization and learning 
approach. Predictive performance of all models was quantified via (area under) receiver 
operating characteristic (ROC) curves. 

Data Our work is based on a subset of the expenditure records of NACM. All data extrac¬ 
tions and analyses were performed at the Medical Management Department of the NACM 
under supervision of the Chief Medical Officer. The other research partners received no per¬ 
sonally identifiable information (including small cells) from NACM. The patient selection 
and vector representations are described in detail in Section 4.2, 

Class definitions The positive class was dehned as patients that require GLAs for long¬ 
term glycemic control.^ The negative class is then defined as patients that do not need 
GLAs. Expenditure records related to GLAs were used to identify a set of known positives. 
However, the absence of such records in a patient’s resource use history is not proof that 
this patient has no need for GLAs. This subtle difference is crucial, because it is well known 
that patients with impaired glycemic control or T2D often remain undiagnosed and hence 
untreated for a very long time (Harris et ah, 1998a; King et ah, 1998; American Diabetes 
Association et ah, 2014). As we cannot identify negatives, we had to build models from 
positive and unlabeled data. 

PU learning Learning binary classifiers from positive and unlabeled data (PU learning) is 
a well-studied branch of semi-supervised learning (Lee and Liu, 2003; Elkan and Noto, 2008; 
Mordelet and Vert, 2014; Claesen et ah, 2015b). PU learning is more challenging than fully 
supervised binary classification, since it requires special learning approaches and quality 
metrics for hyperparameter optimization that account for the lack of known negatives. We 
benchmarked three PU learning methods, which are discussed in more detail in Section 4.3, 

Software The entire data analysis pipeline was implemented using open-source software. 
Eor general data transformations and preprocessing we used SciPy and NumPy (Jones 
et ah, 2001; Van Der Walt et ah, 2011). The learning algorithms we used are available in 
scikit-learn and EnsembleSVM (Pedregosa et ah, 2011; Claesen et ah, 2014b) . Pinally, we 
used Optunity for automated hyperparameter optimization (Claesen et ah, 2014a). 

4.1 Experimental setup 

We gathered all expenditure records during the 4-year interval of 2008 up to 2012. The se¬ 
lection protocol and representations of patients’ medical resource-use are discussed in detail 

2. GLAs are defined as any drug in ATC category AlO, which includes metformin, sulfonylurea and insulin. 
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in Section 4.2. All vector representations of patients include age (in years), an indicator 
variable for gender and positive entries related to the patient’s medical resource-use. A 
patient vector p can be written in the following general form, where dmeds and dprovs denote 
the number of features in the vectorization of medication and provision use, respectively: 

age gender medication provisions 

M+ {0,1} 

In Sections 4.2.1 and 4.2.2 we explain how records related to medication purchases and 
provisions were represented in vector form. All entries in the vector representations were 
consistently normalized to the interval [0,1] by dividing feature-wise by the 99*^ percentile 
and subsequently clipping where necessary. These normalized vector representations are 
used as inputs for the learning algorithms described in Section 4.3, 

Figure 1 summarizes the full machine learning pipeline, which starts from expendi¬ 
ture records and ends with models to predict whether a patient will start glucose-lowering 
pharmacotherapy along with an estimate of their generalization performance. We used 
nested cross-validation to estimate generalization performance of different learning config¬ 
urations (Varma and Simon, 2006). The outer 3-fold cross-validation is used to estimate 
generalization performance of the full learning approach. Internally, twice iterated 10-fold 
cross-validation was used to find optimal hyperparameters for every learning method. 

Hyperparameter search We used Optunity’s particle swarm optimizer to identify suit¬ 
able hyperparameters for each approach based on the given training set as defined by the 
outer cross-validation procedure (Claesen et ah, 2014a). Every tuple of hyperparameters 
was evaluated using twice iterated 10-fold cross-validation on the training set. Per tech¬ 
nique, the hyperparameters that maximized cross-validated performance were selected and 
used to train a model on the full training set. 

Model evaluation Models are compared based on area under the ROC curve. ROC 
curves visualize a classifier’s performance spectrum by depicting its true positive rate 
(TPR)^ as a function of its false positive rate (FPR)^ while varying the decision threshold 
to decide on positives. Area under the ROC curve (AUROC) is a useful summary statistic 
of a classifier’s performance. AUROC is equal to the probability that the classifier ranks 
a random positive higher than a random negative and is known to be equivalent to the 
Wilcoxon test of ranks (Hanley and McNeil, 1982). 

Computing ROC curves Full label knowledge is required to compute ROC curves. In 
previous work, we introduced a method to compute bounds on ROC curves based on positive 
and unlabeled data (Claesen et ah, 2015a). Briefly, it is based on the positions of known 
positives in a ranking produced by a given classifier and requires two things; 

• The rank distributions of labeled and latent positives must be comparable. This holds 
when known and latent positives follow the same distribution in input space (ie. the 
vector representation of patients). This is a fair assumption in our application, since 
we specifically ignore records after the start of glucose-lowering pharmacotherapy 

3. TPR measures the fraction of true positives that are correctly identified by the classifier. 

4. FPR measures the fraction of true negatives that are incorrectly identified by the classifier. 


p e 


p 2-t-dmeds Ydprovs _ 
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] outer cross-validation (3-fold on XscaZed arid y): 

1. determine folds based on labels y 
I 2. loop: 

] • determine train and test data , X(°“j 

• build model on with associated labels - ■ 

• test model on with associated labels 

[ 3. aggregate performance estimates across folds 


i 

performance estimates 
ROC and PR curves 


optimized models 

feature importance 


1 . 


2 . 


determine optimal hyperparameters A* 

• optimize area under ROC curve 

• estimated using 2 x 10-fold 

cross-validation on and 

its associated labels 

• via particle swarm optimization 
train model with A* on 


Figure 1: Overview of the full learning approach: data set vectorization, normalization and 
the nested cross-validation setup. Per iteration, hyperparameter optimization 
and model training is done based exclusively on . 


while identifying the set of positives (see Section 4.2), so the medication regimen of 
known positives has not yet diverged from the regimen of untreated patients. 

• An estimate (3 of the fraction of latent positives in the unlabeled set is needed, that 
is the fraction of members that have never used GLAs but are likely to start glucose¬ 
lowering pharmacotherapy. In the period 2010-2014 roughly 8% of members of NACM 
aged 40 or higher started using GLAs. Underestimating (3 results in an underestimated 
ROG curve and vice versa (Claesen et ah, 2015a). We opted to be conservative and 
used f3io = 5% to estimate lower bounds and f3up = 10% for upper bounds. 

We consistently used the lower bounds for hyperparameter search. All our performance 
reports contain lower and upper bounds, based on f3io and I3up, respectively. 
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Diagnosing overfitting In addition to measuring performance, we diagnosed overfitting 
via the concept of rank distributions as defined by Claesen et al. (2015a). The rank distri¬ 
bution of a subset of test instances is defined as the distribution of the positions of these 
test instances in a ranking of the full test set based on a model’s predicted decision values. 
We diagnose overfitting based on the rank distributions of known positive training instances 
i'Ptrain) and known positives in the independent test fold {Vtest) after predicting the full 
data set. If the model overfits, the rank distribution of Vtrain is inconsistent with the rank 
distribution of Vtest- Specifically, ranks in Vtest are worse than those in Vtrain when the 
model overfits. This can be quantified via the Mann-Whitney U test (Mann and Whitney, 
1947) based on ranks of Vtrain and Vtest after predicting the full data set (that is all outer 
folds). The Mann-Whitney U test is expected to yield a non-significant result when the 
rank distributions of Vtrain and Vtest are comparable. We report the average p-values of the 
test across outer cross-validation folds for each model (low p-values indicate overfitting). 

4.2 Data Set Construction 

We constructed a data set containing records of patients born before 1973 (e.g. 40 or more 
years old in 2012). Patients with records of glucose-lowering agents (GLAs) during less 
than 30 days were discarded. Patients with records of glucose-lowering therapy prior to 
2012 were discarded. Patients that joined NACM after 2005 were also discarded, as we 
cannot determine whether these patients used GLAs in the recent past. 

All patients that started glucose-lowering pharmacotherapy in 2012 or later are included 
as known positives (n = 31,066), along with unlabeled patients that were sampled at ran¬ 
dom from the remaining NACM members (n = 79, 243). Known positives have a minimum 
of 30 days between the first and last purchase of GLAs to avoid contaminating the data set 
with false positives, for instance due to insulin use in surgical and medical ICUs (Van den 
Berghe et ah, 2001, 2006). It must be noted that some false positives remain, that is patients 
that use GLAs but not for glycemic control. 

In Sections 4.2.1 and 4.2.2 we describe the vector representations of records regarding 
medication and medical provisions, respectively. 

4.2.1 Representation of medication records 

The simplest way to represent medication purchases during a time interval is by having 
one input dimension per active substance (level 5 ATC codes) and counting the purchased 
volume in terms of DDDs. This representation is easy to construct but fails to capture any 
similarity between active substances, such as the system or organ on which they act. 

Imposing structure We can directly use the hierarchical structure of the ATC system 
to define a measure of similarity between drugs. To impose structure between drugs we 
included input dimensions related to more generic levels of the ATC hierarchy (levels 1 to 4). 
On more generic levels we summed all DDD counts of active substances per category (level 
5). This redundancy allowed us to express similarity between different active substances 
with a standard inner product. By normalizing every feature to the unit interval, we 
obtained the desired effect that patients with comparable drug use on ATC level 5 are more 
similar than patients that only share coefficients on more generic levels. Figure 2 illustrates 
this vector representation of trees and the effect of normalization. 
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Va = [ [l.O 0.0 3.0 2.0 1.0 o.o) [l.O 5.0 l.O] fTiol ] “p= [ [O.O 2.0 1.0 2.0 0.0 2.ol (2.0 3.0 2.0] fTT) ] 

-S = [ [1.0 2.0 3.0 2.0 1.0 2'F)^2.0 5.0 2.0)17!^ ] - 

VX = [ [1.0 0.0 1.0 1.0 1.0 0.0) 0.5 1.0 0.5 fro) ] V*B = [ [0.0 1.0 0.3 1.0 0.0 1.0) 1.0 0.6 1.0 fr?) ] 


Figure 2: Visualization and vectorization of trees. In the tree representation, the value of 
internal nodes is the sum of the values of its children. The unnormalized vector 
representations Va and Vb contain the values per node in the tree representation 
in some hxed order. Inner products between unnormalized representations Va 
and Vb are mainly influenced by the top level nodes, since those have the largest 
value by construction. This undesirable effect can be hxed through feature-wise 
scaling. The scaling vector S was constructed using node-wise maxima. The 
normalized vector representations and are obtained by dividing the vector 
representations (Va, Vb) element-wise by entries in the scaling vector S. V\ and 
V^ are used as input to classihers in the remainder of this work. As desired, the 
inner product of normalized vector representations is increasingly inhuenced by 
similarities at higher depths in the tree representations. 


Summary All vectorizations related to drug purchases are described in Table 2. 


vectorization 

description 

^meds 

ATC 5 

counts of DDDs per medication class in ATC level 5 

4,580 

ATC 1-4 

counts of DDDs per medication class in ATC levels 1-4 

1,257 

ATC 1-5 

counts of DDDs per medication class in ATC levels 1-5 

5,837 


Table 2: Summary of vectorization schemes used for records of drug purchases. 


4.2.2 Representation of provision records 

When considering a specihc time period, we can describe records by a (sparse) three- 
dimensional tensor containing frequency counts as illustrated in Figure 3. We hltered all 
provisions with a description containing diabetes, insulin and glucose and provisions not 
recorded with a physician identiher. After hltering, 5,799 distinct provision codes remain 
(denoted by j^provisions). 
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S over patients 



PHYSICIAN MATRIX 


^provisions x ^physicians 
5,799 X 118,554 


Figure 3: Tensor formulation of medical provisions with three components: patients, physi¬ 
cians and provisions. Each entry in the tensor is the frequency of the given tuple. 
This provision tensor is very sparse. The patient matrix is obtained by sum¬ 
ming counts over all physicians (transposed). The physician matrix is obtained 
by summing counts over all patients. These matrices capture complementary 
information. 


Each patient is modelled by a histogram of their provisions in the period of interest. 
This essentially means we compute the sum over the p/iysicran-component of the tensor 
representation to obtain a matrix, in which rows and columns represent patients and pro¬ 
visions, respectively. Unfortunately, the encoding of provisions has no medically relevant 
structure in contrast to the ATC hierarchy for drngs as discussed in Section 4.2.1. 

Imposing structure In order to define a reasonable similarity measure between patients, 
we first had to impose a structure onto the nomenclature that captures similarity between 
provisions. To structure provisions, we should not use information originating from the 
patient matrix, as this may cause information leaks (since the patient matrix is used directly 
in our models for prediction). Instead, we used the complementary physician matrix as a 
basis to define similarity between provisions, which essentially serves as a proxy for the 
medical specializations to which each provision belongs. We started from cosine similarity 
between nomenclature codes based on the physician matrix. We used cosine similarity 
because it is known to work well for text mining with bag-of-words representations, which 
is comparable to our use case as it also features sparse, high dimensional input spaces. The 
cosine similarity Kcos between two row vectors u and v is defined as: 

(u, v) _ uv'^ 
u|| • ||v|| ||u|| • ||v 

Using cosine similarity we can constrnct a pair-wise similarity matrix Sprov between provi¬ 
sions based on the rows of the physician matrix Xj, i = .^provisions: 

C — (^ V \\ c laii^provisionsxi^provisiims /q\ 

^prov — \^cos\^Z'! t 

Sprov expresses similarity between provision codes based on the physicians that provide 
them and can be regarded as a proxy for the medical subdomain each provision frequently 
occurs in. In our context, its entries range from 0 (completely orthogonal) to -|-I (exact 
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similarity). To impose sparsity we set all entries of Sprov below 0.05 to 0. Its structure is 
visualized in Figure 4, which clearly indicates that our approach successfully identifies some 
coherent groups of provisions. 




Figure 4: Structure of the provision similarity matrix Sprov based on providing physicians. 

Finally, the structured representation of provisions Pstruct is defined as the matrix prod¬ 
uct between the patient matrix P fiat and the provision similarity matrix Sp^-oj;: 

P struct = P flat X Sprov G R#P-ii-nMprovisions ^ ( 4 ) 

Pstruct approximately captures which provisions occur in a patient’s history with redun¬ 
dancy based on medical specializations. 

Summary All vectorizations related to medical provisions are described in Table 3, 


vectorization 

symbol 

description 

^provs 

PROVS flat 

PROVS STRUCT 

PROVS BOTH 

P flat 

P struct 

P flat 1 P struct 

entries taken from the patient matrix 
captures similarity between provisions 
concatenation of flat & structured 

5,799 

5,799 

11,598 


Table 3: Summary of vectorization schemes used for records of medical provisions. 


4.3 Modeling approaches 

Having only positive and unlabeled data (PU learning) presents additional challenges for 
learning algorithms. Two broad classes of approaches exist to tackle these problems; (i) 
two-phase methods that first attempt to identify likely negatives from the unlabeled set and 
then train a supervised model on the positives and inferred negatives (Liu et al., 2002; Yu, 
2005) and (ii) approaches that treat the unlabeled set as negatives with label noise (Elkan 
and Noto, 2008; Lee and Liu, 2003; Mordelet and Vert, 2014; Claesen et ah, 2015b). 

We have tested three approaches from the latter category in this work, namely class- 
weighted SVM (Liu et ah, 2003), bagging SVM (Mordelet and Vert, 2014) and the robust 
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ensemble of SVM models (Claesen et al., 2015b). All of these approaches are based on 
support vector machines. We used the linear kernel on vector representations of patients as 
described in Section 4.2 ^ We will briefly introduce each method in the following subsections. 


4.3.1 Class-WEIGHTED SVM 

Class-weighted SVM (CWSVM) uses a misclassification penalty per class. CWSVM was 
first applied in a PU learning context by Liu et al. (2003), by considering the unlabeled set 
to be negative with noise on its labels. A CWSVM is trained to distinguish positives (V) 
from unlabeled instances (U), leading to the following optimization problem; 


^ N N 

2 EE aiajyiVjKixi, x^) + 

’ i=l j = l i£U 

N 

s.t. ViC^ ajyjui^i, :s.j) + b)>l-(i, 
i=i 

?! > 0 , 


( 5 ) 


i = 1,... ,N, 
i = I,... ,N, 


where a G are the support values, y G {—1, 4-1}^ is the label vector, k{-, •) is the kernel 
function, b is the bias term and ^ G are the slack variables for soft-margin classification. 
The misclassification penalties Cp and Cu require tuning. We used the implementation 
available in scikit-learn (Pedregosa et ah, 2011) based on LIBSVM (Chang and Lin, 2011). 


4.3.2 Bagging SVM 


In bagging SVM, random resamples are drawn from the unlabeled set and CWSVM classi¬ 
fiers are trained to discriminate all positives from each resample (Mordelet and Vert, 2014). 
Resampling the unlabeled set induces variability in the base models which is exploited via 
bagging. Base model predictions are aggregated via majority voting. 

Bagging SVM with linear base models has two hyperparameters, namely the size of 
resamples of the unlabeled set nu and the misclassification penalty on unlabeled instances 
Cu- The misclassification penalty on positives Cp is fixed via the following rule; 


_ nu X Cu 
\V\ 


( 6 ) 


where \V\ denotes the number of known positives. The heuristic rule in Equation 6 is 
common in imbalanced settings (Cawley, 2006; Daemen et ah, 2009). We implemented 
bagging SVM using the EnsembleSVM library (Claesen et ah, 2014b). 


4.3.3 Robust ensemble of SVM models 

The robust ensemble of SVM models (RESVM) is a modified version of bagging SVM in 
which both the positive and unlabeled sets are resampled when constructing base model 
training sets (Claesen et ah, 2015b). The extra resampling induces additional variability 
between base models which improves performance when combined with a majority vote 

5. Though it must be noted that the ensemble methods are always implicitly nonlinear. 
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aggregation scheme. Claesen et al. (2015b) demonstrated that resampling the positive 
set provides robustness against false positives, which makes RESVM appealing for our 
application since our data set is known to contain a small fraction of false positives (as 
explained in Section 4.2). 

When using linear base models, the RESVM approach has four hyperparameters that 
must be tuned, namely resample sizes and misclassification penalties per class. This ap¬ 
proach was implemented based on EnsembleSVM (Claesen et ah, 2014b). 

5. Results and discussion 

Section 5.1 shows the predictive performance per learning configuration and compares these 
performances to the current state-of-the-art in large-scale risk assessment for T2D. Sec¬ 
tion 5.2 shows performance curves of the best configuration, which enable us to determine 
suitable cutoffs to identify target groups in practice. Einally, Section 5.3 describes a simple 
approach to assess which features contribute most to risk according to our best models. 

5.1 Benchmark of learning methods 

Table 4 summarizes the performance of each learning configuration. The AGE,gender 
feature set provides a baseline for comparison, all other feature sets include these as well. As 
shown in the results, this two-dimensional representation already carries some information. 


features 

RESVM 


bagging SVM 

class-weighted SVM 

AUROC (%) 

P 

AUROC(%) 

P 

AUROC(%) 

P 

AGE, GENDER 

55.74-56.64 

* 

58.61-59.67 


60.96 62.21 

0.04 

ATC 5 

72.55-74.43 

0.17 

70.83-72.62 

0.09 

71.89-73.74 

0.01 

ATC 1-4 

73.12 75.07 

0.07 

69.57-71.24 


73.05-74.91 

0.04 

ATC 1-5 

74.34 76.27 

0.13 

71.50-73.27 

0.05 

72.13-73.94 


PROVS FLAT 

58.45-59.51 

* 

60.74-61.92 


63.01-64.31 


PROVS STRUCT 

57.40-58.39 

0.02 

59.53-60.58 

0.01 

62.53-63.81 

0.01 

PROVS BOTH 

58.89-59.75 

* 

61.72-62.87 


63.45 64.75 


ATC PROVS 

74.89-76.82 

0.04 

69.72-71.40 


73.77-75.64 



Table 4; Average bounds on area under the ROC curve and p-value of the Mann-Whitney 
U test over all folds for different feature sets per learning approach in a long¬ 
term prediction setup. The lower and upper bounds on AUC were computed 
with /3io = 0.05 and Pup = 0.10, respectively. The atg | PROVS feature set is the 
concatenation of the best performing sets per aspect, namely ATC 1—5 and provs 
BOTH. Stars (*) denote p-values below 0.005. 


Based on Table 4 we can conclude that a patient’s medication history is highly informa¬ 
tive to predict the start of GLA therapy. Using features based on ATC level 5, the RESVM 
model obtained an AUC between 72.55% and 74.43%. By adding redundancy as described 
in Section 4.2.1 the performance based on medication history alone was further increased 
to between 74.34% and 76.27% for the best learning approach (RESVM). 


14 



Predicting the Start of Glucose-Lowering Pharmacotherapy 


Predictive performance based on provisions alone turned out fairly poor, showing only a 
mild improvement compared to models based exclusively on age and gender for all learning 
algorithms. Interestingly, the best approach for representations based on provisions was 
class-weighted SVM, with RESVM being worst of all three learning methods. It appears 
that for these representations, large training sets are better; class-weighted SVM uses the 
full training set, bagging SVM uses all positives and a subset of unlabeleed instances per 
base model and RESVM uses (small) subsets of both positives and unlabeled instances per 
base model. 

The best representation included age, gender, and structured information about drugs 
and provision history of the patient. The best learning method on this representation was 
RESVM, achieving an AUC between 74.89% and 76.82%. In Section 5.1.1 we compare the 
performance of our approach to competing screening methods. 

Finally, RESVM appears most resistant to overfitting in the hyperparameter optimiza¬ 
tion stage as it consistently exhibits the highest average p-values in our diagnostic test 
(higher is better, see Section 4.1). We believe this to be attributable to the use of small 
resamples of both positives and unlabeled instances when training base models in RESVM, 
since this makes it unlikely to obtain a structural overfit of the ensemble model on the full 
training set. In contrast, bagging SVM is far more prone to overfitting because every base 
model is trained on all positives. 

5.1.1 Comparison to state-of-the-art 

Our best approach obtained cross-validated AUC between 74.89% and 76.82% (exact num¬ 
bers are unknown due to the lack of known negatives). This is comparable to many com¬ 
peting approaches, based on questionnaires and some clinical information such as the Cam¬ 
bridge Risk Score (AUC 67%-80%, (Spijkerman et ah, 2004; Griffin et ah, 2000)), the Danish 
risk score (AUC 72%-87.6%, (Gliimer et ah, 2004)), the German diabetes risk score (AUC 
75%-83%, (Schulze et ah, 2007)) and a Dutch approach (AUC 74%, (Baan et ah, 1999)). 
Approaches using detailed clinical information generally perform better, but are more ex¬ 
pensive to maintain (Stern et ah, 2002; McNeely et ah, 2003; Lindstrom and Tuomilehto, 
2003; Heikes et ah, 2008). They key advantage of our approach is the fact it is easy to 
implement on a population wide scale at virtually no operational cost. 

The target class we used in this work is stricter than in the risk prediction methods 
mentioned in Section 2, namely patients that require GLAs for glycemic control versus 
patients with impaired glycemic control, respectively (except for Lindstrom and Tuomilehto 
(2003), which also predicted drug-treated T2D). It is reasonable to assume that our models 
generally rank patients with impaired glycemic control but without a need for GLAs higher 
than patients without impaired glycemic control. In our performance assessment both 
of these patient groups are essentially treated as negatives, in contrast to the screening 
programmes mentioned previously which treat patients with impaired glycemic control as 
positives. Hence, we believe the performance of our models would appear higher when 
evaluated against a target class comprising all patients with impaired glycemic control, as 
is done in the evaluation of other screening approaches. Unfortunately, we are unable to 
accurately identify patients with impaired glycemic control but without need for GLAs. 
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All competing methods use either clinical information or direct knowledge of risk factors 
that is unavailable to us. Furthermore, the characteristics that are lacking in our data have 
been reported to be the most informative to assess risk for T2D (Lindstrom and Tuomilehto, 
2003; Stern et ah, 2002; McNeely et ah, 2003). We obtained generalization performances 
that are comparable to existing approaches, despite these missing predictors. Finally, our 
approach is the only one that is based exclusively on existing data that is always available, 
without requiring additional patient contacts or clinical tests. 

5.2 Receiver Operating Characteristic curves for RESVM 

The RESVM model based on ATC | PROVS vectorization had the best overall performance. 
Figure 5 shows bounds on the ROC and PR curves for this model. These bounds were com¬ 
puted using the technique described by Claesen et al. (2015a). The true curve is unknown 
because we do not dispose of negative labels. 

ROC curves enable us to determine a cutoff to use in practice, based on a suitable 
balance between true and false positive rate (sensitivity and 1-specificity, respectively). 
Determining a suitable balance requires a tradeoff between the relative importance of iden¬ 
tifying undiagnosed patients (true positives) vis-a-vis increased amounts of screening tests 
on patients that are in fact healthy (false positives). 




(a) Receiver Operating Characteristic curve. (b) Precision-Recall curve. 

Figure 5: Performance curves for the best model: RESVM classifier based on ATC | PROVS 
vectorization. The lower and upper bounds are estimated using /3/o = 5% and 
Pup = 10%, respectively. 


It should be noted that precision depends on class balance, and therefore the PR curve 
shown in Figure 5(b) is not representative for screening an overall population, since the 
overall population has a higher fraction of negatives than our custom data set (i.e. precision 
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would be lower in practice). In contrast, the bounds in ROC space are representative because 
ROC curves are insensitive to changes in class distribution (Fawcett, 2006). 

5.3 Feature importance analysis for the RESVM model 

The RESVM model is implicitly nonlinear due to its majority voting rule to aggregate 
base model decisions, which poses problems in assessing the importance of each predictor. 
However, our use of linear base models enables a simple approximation. The decision value 
for base model i G {1,..., nmodeis} for a test instance z can be written as follows; 

fi{z) = (wi,z) + Pi, 

where Wj is the separating hyperplane and pi is a bias term. A simple linear approximation 
of such ensemble models can be computed as the average of all base model hyperplanes: 

^models 

W = ^ Wj/nmodels- 

i=l 

Feature importance can then be determined based on the coefficients in w. Since we nor¬ 
malized all features to the unit interval [0,1] we can conclude that the features with largest 
(positive) coefficients contribute most to risk as identified by our model. 

Via this approach, the risk associated to use of cardiovascular medication (ATC main 
category c) far outweights all other ATC main categories. This is not surprising, as diabetes 
is known to be strongly related to cardiovascular problems (Kannel and McGee, 1979; 
Grundy et ah, 1999; Hu et ah, 2002). The relative importance of features will be discussed 
in detail in a subsequent medical paper. 

6. Conclusion 

In this work we have demonstrated the ability to predict clinical outcomes based solely 
on readily available health expenditure data. We successfully built proof-of-concept clas¬ 
sifiers to predict the start of glucose-lowering pharmacotherapy in patients above 40. Our 
experiments show that accurate predictions can be made based on historical medication 
purchases. These predictions can be further improved by incorporating information about 
medical provisions and the use of appropriate vectorization schemes. 

Since adult patients starting glucose-lowering pharmacotherapy are mainly afflicted with 
type 2 diabetes (T2D), our models can be used for T2D risk assessment. Our approach 
presents a novel method for case finding which can be easily incorporated in modern health¬ 
care, since all required data is already available. The associated operational costs are very 
low as the entire workflow can be fully automated without any need for patient contacts or 
medical tests. As such, our work provides an efficient and cost-effective method to identify 
a high risk subgroup, which can then be screened using decisive clinical tests. 

Interestingly, our approach works well even though health expenditure data contains 
very limited direct information on some important known risk factors. In that sense, our 
approach is fundamentally different from the current state-of-the-art which mainly focuses 
on quantifying known risk factors directly, either by asking the patient or through clinical 
tests. The performance of our approach is expected to improve further when additional 
information about these risk factors can be obtained, e.g. family history and lifestyle. 
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